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Abstract 

It is a long-standing open problem whether there always exists a compression scheme 
whose size is of the order of the Vapnik-Chervonienkis (VC) dimension d. Recently com¬ 
pression schemes of size exponential in d have been found for any concept class of VC 
dimension d. Previously, compression schemes of size d have been given for maximum 
classes, which are special concept classes whose size equals an upper bound due to Sauer- 
Shelah. We consider a generalization of maximum classes called extremal classes. Their 
definition is based on a powerful generalization of the Sauer-Shelah bound called the Sand¬ 
wich Theorem, which has been studied in several areas of combinatorics and computer 
science. The key result of the paper is a construction of a sample compression scheme 
for extremal classes of size equal to their VC dimension. We also give a number of open 
problems concerning the combinatorial structure of extremal classes and the existence of 
unlabeled compression schemes for them. 


1. Introduction 


Generalization and compression/simplification are two basic facets of “learning”. General¬ 
ization concerns the expansion of existing knowledge and compression concerns simplifying 
our explanations of it. In machine learning, compression and generalization are deeply 
related: learning algorithms perform compression and the ability to compress guarantees 
good generalization. 


A simple form of this connection is how Occam’s Razor (Blumer et al. 1987) is mani 


tested in Machine Learning: if the input sample can be compressed to a small number of 
bits which encodes a hypothesis consistent with the input sample, then good generalization 
of this hypothesis is guaranteed. A more sophisticated notion of compression is given by 
’’sample compression schemes” (Littlestone and Warmuth, 1986). In these schemes the in¬ 


put sample is compressed to a carefully chosen small subsample that encodes a hypothesis 
consistent with the input sample. For example support vector machine can be seen as com¬ 
pressing the original sample to the subset of support vectors which represent a maximum 
margin hyperplane that is consistent with the entire original sample. 
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What is the connection to generalization? In the Occam’s razor setting, the generaliza¬ 
tion error decreases with the number of bits that are used to encode the output hypothesis. 
Similarly for compression schemes, the generalization error decreases with the sample size. 

In the learning model considered here, the learner is given a sample consistent with an 
unknown concept from a target concept class. From the given sample, the learner aims to 
construct a hypothesis that yields a good generalization i.e. a good approximation of the 
unknown concept. A core question is what parameter of the concept class characterizes the 
sample size required for good generalization? The Vapnik-Chervonenkis (VC) dimension 


serves as such a parameter (Blumer et al., 1989) where the exact definition of generalization 


underlying our discussion is specified by the Probably Approximately Correct (PAC) model 
of learning (Valiant, [1984 , Vapnik and Chervonenkis, 1971| . We believe that the size of the 
best compression scheme is an alternate parameter and has several additional advantages: 

• Compression schemes frame many natural algorithms (e.g. support vector machines). 
This gives sample compression schemes a constructive flavor. 

• Unlike the VC dimension, the definition of sample compression schemes as well as the 
fact that they yield low generalization error extends naturally to multi label concept 
classes ( jSamei et al. 2014). This is particularly interesting when the number of labels 
is very large (or possibly infinite), because for that case there is no known combi¬ 
natorial parameter that characterizes the sample complexity in the PAC model (See 


(Daniely and Shalev-Shwartz, 2014)). The size of the best sample compression scheme 


is therefore a natural candidate for a universal parameter that characterizes the sam¬ 
ple complexity in the PAC model. 


Previous work 


Littlestone and Warmuth 


defined sample compression schemes and showed that in 
the PAC model of learning, the sample size required for learning grows linearly with the 
size of the subsamples the scheme compresses to. They have also posed the other direction 
as an open question: Does every concept class have a compression scheme of size depending 


only on its VC dimension? Later Floyd and Warmuth (1995) and Warmuth (2003), refined 


this question: Does every class of VC dimension d have a sample compression scheme of 
size 0{d). 


Ben-David and Litman (1998) proved a compactness theorem for sample compression 


schemes. It essentially says that existence of compression schemes for infinite classes fol¬ 
low^ from the existence of such schemes for finite classes. Thus, it suffices to consider 
only finite concept classes. Floyd and Warmuth (1995) constructed sample compression 


schemes of size loglC*! for every concept class C. More recently Moran et al. (2015) have 
constructed sample compression schemes of size exp(d) log log |Cl where d = VCdim{C). 


Finally, Moran and Yehudayoff (2016) have constructed sample compression scheme of size 


1. The proof of that theorem is however non-constructive. 
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exp(d), resolving Littlestone and Warmuth’s question. Their compression scheme is based 


on an earlier compression scheme by Freund (1995); Freund and Schapire (2012) which was 


defined in the context of boosting. This sample compression scheme is of variable size: It 
compresses samples of size m to subsamples of size O(dlogm). 

For many natural and important families of concept classes, sample compression schemes 
of size equal the VC dimension were constructed, revealing connections between sam¬ 
ple compression schemes and other fields such as combinatorics, geometry, model theory. 


and algebraic topology (e.g. Eloyd| ( 

.989); Helmbold et al. (1992); Ben-David and Litman 

(1998); 

(2012); 

Chernikov and Simon 

(2013) 

; Rubinstein et al. ( 

2009) 

Rubinstein and Rubinstein 

Livni and Simon (20131)). Despite this rich body of work, the refined question 


whether there exists a compression scheme whose size is equal or linear in the VC dimen¬ 
sion remains open. 


Floyd and Warmuth (1995) observed that in order to prove the conjecture it suffices to 


consider only maximal classes (A class C is maximal if no concept can be added without 
increasing the VC dimension). Furthermore, they constructed sample compression schemes 
of size d for every maximum class of VC dimension d. These classes are maximum in the 
sense that their size equals an upper bound (due to Sauer-Shelah) on the size of any 
concept class of VC dimension d. Later, Kuzmin and Warmuth (2007) and Rubinstein 


and Rubinstein (2012) provided even more efficient and combinatorially elegant sample 


compression schemes for maximum classes that are called unlabeled compression schemes 
because the labels of the subsample are not needed to encode the output hypothesis. 

One possibility of making a progress on Floyd and Warmuth’s question is by extending 
the optimal compression schemes for maximum classes to a more general family. In this 
paper we consider a natural and rich generalization of maximum classes which are known by 
the name extremal classes (or shattering extremal classes). Similar to maximum classes, 
these classes are defined when a certain inequality known as The Sandwich Theorem is 
tight. This inequality generalizaes the Sauer-Shelah bound. The Sandwich Theorem as 
well as extremal classes were discovered several times and independently by several groups 


of researchers and in several contexts such as Functional analysis (Pajor, 1985), Discrete 


and Extremal Combinatorics (Bollobas et ah, 1989 Bollobas and Radcliffe, 1995). Even 


geometry (Lawrence, 1983), Phylogenetic Combinatorics (Dress, 1997 Bandelt et ah, 2006| 


though a lot of knowledge regarding the structure of extremal classes has been accumulated. 


the understanding of these classes is still considered incomplete by several authors (Bollobas 


and Radcliffe 

1995 

Greco 

1998 

Ronyai and Meszaros 

2011 


Our results 

Our main result is a construction of sample compression scheme of size d for every extremal 
class of VC dimension d. When the concept class is maximum, then our scheme specializes 
to the compression scheme for maximum classes given in ( ]Floyd and Warmuth 1995). 
Our generalized sample compression scheme for extremal classes is still easy to describe. 
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However its analysis requires more combinatorics and heavily exploits the rich structure 
of extremal classes. Despite being more general, the construction is simple. We also give 
explicit examples of maximal classes that are extremal but not maximum (see Example]^. 
This means that the compression scheme presented here is not implied by the previous 
sample compression schemes for maximum classes, not even implicitly. 

We also discuss a certain greedy peeling method for producing an unlabeled compres¬ 
sions scheme. Such schemes were first conjectured in (Kuzmin and Warmuth, 2007) and 
later proven to exist for maximum classes (Rubinstein and Rubinstein, 2012). However the 
existence of such schemes for extremal classes remains open. We relate the existence of 
such schemes to basic open questions concerning the combinatorial structure of extremal 
classes. 


Organization 

In Section we give some preliminary definitions and define extremal classes. We also 
discuss some basic properties and give some examples of extremal classes which demon¬ 
strate their generality over maximum classes. In Section we give a labeled compression 
scheme for any extremal class of VC dimension d. Finally, in Section]^ we relate unlabeled 
compression schemes for extremal classes with basic open questions concerning extremal 
classes. 


2. Extremal Classes 
2.1 Preliminaries 


Concepts, concept classes, and the one-inclusion graph. A concept c is a mapping 
from some domain to {0,1}. We assume for the sake of simplicity that the domain of c 
(denoted by dom{c)) is finite and allow the case that dom{c) = 0. A concept c can also 
be viewed as a characteristic function of a subset of dom{c), i.e for any domain point 
X G dom{c), c(x) = 1 iff x G c. A concept class C is a set of concepts with the same 
domain (denoted by dom{C)). A concept class can be represented by a binary table (see 
Fig. 0 , where the rows correspond to concepts and the columns to the elements of dom{C). 
Whenever the elements in dom{C) are clear from the context, then we represent concepts 
as bit strings of length |fiom(C)| (See Fig. [^. 

The concept class C can also be represented as a subgraph of the Boolean hypercube 
with \dom{C)\ dimensions. Each dimension corresponds to a particular domain element, 
the vertices are the concepts in C and two concepts are connected with an edge if they 
disagree on the label of a single element (Hamming distance 1). This graph is called the 


one-inclusion graph of C (Haussler et ah, 1994). Note that each edge is naturally labeled 


by the single dimension/element on which the incident concepts disagree (See Fig. [^. 
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Figure 1: Table and one-inclusion graph of an extremal class C of VC dimension 2. The 
reduction = {00000,10000,11000,11100} has the domain {xi, X 3 , X 4 , X 5 , xq}. 
Note that each concept in (7*^ corresponds to an edge labeled X 2 - Similarly 
C'{*3>*4} consists of the single concept {1100} over the domain {xi, X 2 ,X 5 , xe}. 
Note that this concept corresponds to the single cube of C with dimension set 

{X3,X4}. 


Restrictions and samples. We denote the restriction/s ample of a concept c onto S C 
dom{c) as cIS*. This concept has the restricted domain S and labels this domain consistently 
with c. Essentially concept c|5 is obtained by removing from row c in the table all columns 
not in S. The restriction/set of samples of an entire class C onto S C dom{C) is denoted as 
CIS. A table for C|S is produced by simply removing all columns not in S from the table 
for C and collapsing identical rowsj^ Also the one-inclusion graph for the restriction C|S is 
now a subgraph of the Boolean hypercube with |S| dimensions instead of the full dimension 
|dom(C)|. We also use C — S as shorthand for C\[dom{C) \ S) (since the columns labeled 
with S are removed from the table). Note that the sub domain S C dom{C) induces an 
equivalence class on C: Two concepts c,c' G C are equivalent iff c|S = c'|S. Thus there is 
one equivalence class per concept of CjS". 

2 . We define c |0 = 0 . Note that Cfh = { 0 } if C 7^ 0 and 0 otherwise. 
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Cubes. A concept class B is called a cube if for some subset S of the domain dom{B), 
the restriction B\S is the set of all concepts over the domain S and the class B — S 
contains a single concept. We denote this single concept by tag{B). In this case, we say 
that S is the dimension set of B (denoted as dim(S)). For example, if B contains two 
concepts that are incident to an edge labeled x then S is a cube with dim(S) = {x}. We 
say that B is a cube 0/concept class C if i? is a cube that is a subset of C. We say that B 
is a maximal cube of C if there exists no other cube of C which strictly contains B. When 
the dimensions are clear from the context, then a concept is described as a bit string of 
length dom{C). Similarly a cube, B, is described as an expression in {0,1, where 

the dimensions of dim(i?) are the *’s and the remaining bits is the concept tag{B). 

Reductions. In addition to the restriction it is common to define a second operation 
on concept classes. We will describe this operation using cubes. The reduction is a 
concept class on the domain dom{C) \ S which has one concept per cube with dimensions 
set S 

:= {tag(i?) : R is a cube of C such that dim(R) = S'}. 

The reduction with respect to a single dimension x is denoted as . See Fig. [^for some 
examples. 


Shattering and strong shattering. There are two important properties associated 
with subsets S of the domain of C. We say that S C dom{C) is shattered by C, if C\S is 
the set of all concepts over the domain S'. Furthermore, S is strongly shattered by C, 
if C has a cube with dimensions set S. We use s{C) to denote all shattered sets of C and 
st(C) to denote all strongly shattered sets, respectively. Clearly, both s{C) and st{C) are 
closed under the subset relation, and st{C) C s{C). 

The following theorem is the result of accumulated work by different authors, and parts 
of it were rediscovered independently several times (Pajor 1985 Bollobas and Radchffe| 
Dress[ 1997 


1995 


Anstee et ah, 2002). 


Theorem 1 (Sandwich Theorem) Let C be a concept class. 

|st(C)| < |C| < |s(C)|. 


This theorem has been discovered independently several times and has several proofs 
(see (Moran, 2012) for more details). One approach, which is also used in proving the 
Sauer-Shelah Lemma (Sauer, 1972; [Shelah 1972), is via down-shifting. We now sketch this 
approach. In a down-shifting step we pick a dimension x G dom{C), and every c G C is 
replaced by its x-neighbor (i.e. the concept d which disagrees with c only on x) if the 
following conditions hold: (i) c{x) = 1, and (ii) the x-neighbour of c does not belong to C. 
One can easily verify that if C is obtained from C by a down-shifting step, then \C'\ = |C|, 
s{C') C s(C), and st{C') 3 st{C). Eventually, after enough down-shifting steps have been 
performecj^ the resulting class becomes downward-closed (see Example below). Eor such 


3 . In fact one step on each x £ X suffices (Moran 2012 ). 
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classes the cardinality of s{C), st{C), and \C\ are all equal. This implies the inequalities 
in the Sandwich Theorem for the original class. 

The inequalities in this theorem can be strict: Let C C {0, !}”■ be such that C contains 
all boolean vectors with an even number of I's. Then st{C) contains only the empty set 
and s{C) contains all subsets of {1,... ,n} of size at most n — 1. Thus in this example, 

|sf(C)| = 1, \C\ = 2^-^ and |s(C)| = - 1. _ _ _ 

The VC dimension (Vapnik and Chervonenkis, [T qTI Blumer et ah, 1989) is defined as: 

VCdim{C) = max{|5| : S G s(C)}. 

Note that by the definition of the VC-dimension: 

s(C) C {5 C dom{C) : |5| < VCdim{C)}. 

Hence, an easy consequence of Theorem is that for every concept class C, we have 
1^1 ^ y^yCdim(C) l\dora{C)\'^ ' ' 

Shelah[[I#^ . 


This is the well-known Sauer-Shelah Lemma (Sauer 


1972 


2.2 Definition of extremal classes and examples 

Maximum classes are defined as concept classes which satisfy the Sauer-Shelah inequality 
with equality. Analogously, extremal classes are dehned as concept classes which satisfy 
the inequalitie^ in the Sandwich Theorem with equality: A concept class C is extremal if 
for every shattered set 5 of C there is a cube of C with dimension set S, i.e. s{C) = st(C'). 
Note that complementing the bits in a column of the table representing C does not affect 
the sets s{C), st(C') and extremality is preserved. Also in the one inclusion graph, only 
the labels of the vertices are affected by such column complementations. 

Every maximum class is an extremal class. Moreover, maximum classes of VC dimen¬ 
sion d are precisely the extremal classes for which the shattered sets consist of all subsets 
of the domain of size up to d. The other direction does not hold - there are extremal classes 
that are not maximum. All the following examples are extremal but not maximum. 

Example 1 Consider the concept class C over the domain {xi,... ,a:6} given in Fig. 

In this example 

st{C) = s{C) ={ 0 , {xi}, {x2}, {xa}, {x4}, {xs}, {xe}, 

{xi, X2}, {xi, X3}, {xi, X4}, {xi, X5}, {xi, xe}, {x2, X3}, 

{X2, X4}, {X3, X4}, {X4, X5}, {X4, Xe}, {X 5 , Xe}}. 

This example also demonstrates the eubical structure of extremal elasses. 

4 . There are two inequalities in the Sandwich Theorem, but every class which satishes one of them with 
equality also satishes the other with equality (See Theorem [^. 
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Example 2 (Downward-closed classes) 

A standard example of a maximum class of VC dimension d is 


C = {c £ {0, !}”■ : the number of I’s in c is at most d}. 


This is simply the hamming ball of radius d around the all 0 ’s concept. A natural general¬ 
ization of such classes are downward closed classes. We say that C is downward closed if 
for all c £ C and for all c' < c, also d £ C. Here d < c means that for every x £ dom{C), 
d {x) < c{x). It is not hard to verify that every downward closed class is extremal. 

Example 3 (Hyper-planes arrangements in a convex domain) 


Another standard set of examples for maximum classes comes from geometry (see e.g. (Gart¬ 


ner and Welzl 


1994 )). Let H be an arrangement of hyperplanes For each hyperplane 


Pi £ H, pick one of half-planes determined by pi to be its positive side and the other its 
negative side. The hyperplanes of H cutM.^ into open regions (cells). Each cell defines a 
binary mapping with domain H: 

^ ^ (1 if c is in the positive side of pi 

1 0 if c is in the negative side of pi. 

It is known that if the hyperplanes are in general position, then the set C of all cells is a 
maximum class of VC dimension d. 

Consider the following generalization of these classes: Let K V Md be a convex set. 
Instead of taking the vectors corresponding to all of the cells, take only those that correspond 
to cells that intersect K: 

Ck = {c : c corresponds to a cell that intersects K}. 

Ck is extremal. In fact, for Ck to be extremal it is not even required that the hyperplanes 
are in general position. It suffices to require that no d -\- 1 hyperplanes have a non-empty 
intersection (e.g. parallel hyperplanes are allowed). Fig. ^illustrates such a class Ck in- 
the plane. These classes were studied in (Moran, \2012). 


Interestingly, extremal classes also arise in the context of graph theory: 


Example 4 (Edge-orientations which preserve connectivity (Kozma and Moran 


20 ^) 

Let G = {V, E) be an undirected simple graph and let be a fixed reference orientation. 
Now an arbitrary orientation of E is a function d : E ^ {Ojl}-' U d{e) = 0 then e is 
oriented as in ^ and if d{e) = 1 then e is oriented opposite to . Now let s,t £V be two 
fixed vertices, and consider all orientations of E for which there exists a directed path from 
s to t. The corresponding class of orientations E —>■ {0,1} is an extremal concept class 
over the domain E. 
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Figure 2: An extremal class that correspond to the cells of a hyperplane arrangement of a 
convex set. An arrangement of 4 lines is given which partitions the plane to 10 
cells. Each cell corresponds to a binary vector which specifies its location relative 
to the lines. For example the cell corresponding to 1010 is on the positive sides of 
lines 1 and 3 and on the negative side of lines 3 and 4. Here the convex set K is an 
ellipse and the extremal concept class consisting of the cells the ellipse intersects 
is Ck = {1000,1010,1011,1111,1110,0010,0000,0110} (the cells 0100,0111 are 
not intersected by the ellipse). The class Ck here has VC dimension 2. Note 
that it’s shattered sets of size 2 are exactly the pairs of lines whose intersection 
point lies in the ellipse K. 
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Moreover, the extremality of this class yields the following result in graph theory: The 
number of orientations for which there exists a directed path from s to t equals the number 
of subgraphs for which there exists an undirected path from s to t. For a more thorough 


discussion and other examples of extremal classes related to graph orientations see (Kozma 


and Moran, 2013). 


Example 5 (A general construction of a maximal class that is extremal but not 
maximum) 

Take a k-dimensional cube and glue to each of its vertices an edge of a new distinct di¬ 
mension. The resulting elass has 2^^^ coneepts and n = 2^ + k dimensions. Let C be the 
eomplement of that elass. 


Claim 1 C is an extremal maximal class of VC dimension n — 2 which is not maximum. 

We prove this claim in Appendix\^ Note that \ C\ = 2*^ —2^+^ = — and maximum 

elasses of VCdim d = n — 2 over n dimensions have size 2"' — re — 1 = 2^ — 2^ — /c — 1. So 
the maximum elasses of VCdim n — 2 are by 2^ — k — 1 larger than the constructed extremal 
maximal class of VCdim n — 2. 


2.3 Basic properties of extremal classes 


Extremal classes have a rich combinatorial structure (See (Moran, 2012) and references 
within for more details). We discuss some of parts which are relevant to compression 
schemes. 

The following theorem provides alternative characterizations of extremal classes: 


Theorem 2 (Bollobas and Radcliffe (1995); Bandelt et al. (2006)) The following 
statements are equivalent: 

1. C is extremal, i.e. s{C) =st(C'). 

S. |s(C)| = |st(C')|. 


3. |st(C)| = \C\. 

4. \C\ = \s{C)\. 


5. {0, !}"■ \C is extremal. 


The following theorem shows that the property of “being an extremal class” is preserved 


under standard operations. It was also proven independently by several authors (e.g. (Bol¬ 


lobas and Radcliffe 1995; Bandelt et ah, 2006)). 


Theorem 3 Let C be any extremal class, S C dom{C), and B be any eube such that 
dom{B) = dom{C). Then C — S and C^ are extremal eoncept classes over the domain 
dom{C) — S and B CiC is an extremal concept elass over the domain dom{C). 


10 























Note that if C is maximum then C — S and are also maximum, but B n C is not 
necessarily maximum. This is an example of the advantage extremal classes have over the 
more restricted notion of maximum classes. 

Interestingly, the fact that extremal classes are preserved under intersecting with cubes 
yields a rather simple proof (communicated to us by Ami hitman) of the fact that every ex¬ 
tremal class is “distance preserving”. This property also holds for maximum classes (Gart¬ 


ner and Welzl, 1994), however the proof for extremal classes is much simpler than the 


previous proof for maximum classes (given in (Gartner and Welzl 1994)): 


Theorem 4 (Greco (1998)) Let C be any extremal class. Then for every co,ci G C, the 


distance between cq and ci in the one-inclusion graph of C equals the hamming distance 
between ci and C 2 . 


Proof Assume towards contradiction that this is not the case. Among all possible pairs 
of Co, Cl G C for which there is no such path, pick a pair co,ci of a minimal hamming 
distance. Let B be the minimal cube over the domain dom{C) which contains both cq and 
Cl. So the dimensions set of B is dim(i3) = {x : Co(x) ^ ci(x)} and | dim(i?)| > 2. 

We first claim that by the minimality criteria according to which cq , ci were chosen, 
there cannot be any other concept in B n C except cq and ci. Without loss of generality 
assume that for all x G dim(i?), co(x) = 0 and ci(x) = 1 (Otherwise we can flip the bits 
of entire columns without affecting the distances between concepts). If there was now 
another concept c G i?, then c| dim(i?) must have at least one 0 and at least one 1. By the 
minimality according to which cq , ci were chosen - there must exist a path between cq and 
c of length equal the number of I’s in c| dim(i?). Similarly, there must exist a path between 
c and Cl of length equal the number of O’s. The combined path would be the length of 
the Hamming distance between cq and ci. So by the minimality B n C does not contain 
another concept. Therefore H n C = {cq, ci} and this completes the proof of the claim. 

Next we observe that by Theorem ^ B r\ C must be extremal. However we claim 
that H n C is not extremal: Since there is no edge between its two concepts cq and ci, 
si{B n C) = {0} and therefore | st(C')| = 1 < 2 = ICI. This means that H n G is not 
extremal which is a contradition. ■ 


The following lemma brings out the special cubical structure of extremal classes. We 
will use it to prove the correctness of the compression scheme given in the following section. 
It shows that if Bi and B 2 are two maximal cubes of an extremal class C then their 
dimensions sets dim(Hi) and dim(H 2 ) are incomparable. 

Lemma 5 Given Bi and B 2 are two cubes of an extremal class C. If Bi is maximal, then 

dim(Hi) C dim(H 2 ) Bi = B 2 . 

Proof Assume towards contradiction that dim(H 2 ) 2 dim(Hi) := D and B 2 7 ^ Bi. The 
cube B^ contains the single concept tag(Hi) in , and the cube is a cube of with 
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dimensions set dim(i? 2 ) \ D. Since is extremal, must be connected (by Theorem 
1^. Therefore there is a path in between the concept tag(i3i) and some concept in the 
cube B 2 ■ This means there is some edge e incident to the concept in . This edge e 
is a one-dimensional cube of labeled with some dimension x. This cube with dimension 
set dim(e) = {x} expands to a cube of C with dimension set dim(i?i) U {x} which contains 
the cube Bi of C. This contradicts the maximality of cube Bi of C. ■ 


One concise way to represent an extremal class is as the union of its maximal cubes. 
With this representation, the extremal class of Fig. |^is described by the expression 

** 0*00 -|- 1***00 -|- 1101*0 + 01010 *, 

where “-t-” stands for union. Note that the dimension sets of the cubes are marked as *’s 
and for the class to be extremal, the dimension sets must be incomparable. 

3. A labeled compression scheme for extremal classes 


Algorithm 1 (A labeled compression scheme for extremal classes) 

Let C be an extremal class. The compression map: 

• Input: A sample s of C. 

• Output: A subsample s' = s\ dim(i?), where B is any maximal cube of C\dom{s) 
that contains the sample s. 

The reconstruction map: 

• Input: A sample s' of size at most VCdim{C). 

• Output: Any concept h which is consistent with s' on dom{s') and belongs to a cube 
B oi C with dimensions set domes'). 


Let C be a concept class. On a high level, a sample compression scheme for C compresses 
every sample of C to a subsample of size at most k and this subsample represents a 
hypothesis on the entire domain of C that must be consistent with the original sample. 
More formally, a labeled compression scheme of size k for C consists of a compression map 
K and a reconstruction map p. The domain of the compression map consists of all samples 
from concepts in C: For each sample s, k compresses it to a subsample s' of size at most k. 
The domain of the reconstruction function p is the set of all samples of C of size at most 
k. Each such sample is used by p to reconstruct a concept h with dom{h) = dom{C). The 
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sample compression scheme must satisfy that for all samples s of C, 

p{k{s)) \dom{s) = s. 

The sample compression scheme is said to be proper if the reconstructed hypothesis h 
always belongs to the original concept class C. 

A proper labeled compression scheme for extremal classes of size at most the VC di¬ 
mension is given in Algorithm Let C be an extremal concept class and s be a sample 
of C. In the compression phase the algorithm finds any maximal cube B of C\dom{s) that 
contains the sample s and compresses s to the subsample determined by the dimensions 
set of that maximal cube. Note that the size of the dimension set (and the compression 
scheme) is bounded by the VC dimension. 

How should we reconstruct? Consider all concepts of C that are consistent with the 
sample s: 

Hs = {h ^ C ■. h\dom{s) = s}. 

Correctness means that we need to reconstruct to one of those concepts. Let s' be the 
input for the reconstruction function and let D := dom{s'). During the reconstruction, the 
domain dom{s) of the original sample s is not known. All that is known at this point is 
that D is the dimensions set of a maximal cube B of C\dom{s) that contained the sample 
s. The reconstruction map of the algorithm outputs a concept in the following set: 

Hb :={h ^ C h lies in cube H'such that 

dim(H') = dim(H) and h\ dim(H) = s| dim(i?)}. 

For the correctness of the compression scheme it suffices to show that for all choices of the 
maximal cube B of C\dom{s), is non-empty and a subset of Hg. The following Lemma 
guarantees the non-emptiness. 

Lemma 6 Let C be an extremal class and let D C dom{C) be the dimensions set of some 
cube ofC\dom{s). Then D is also the dimensions set of some cube of C. 

Proof Clearly the dimension set D is shattered by C\dom{s) and therefore it is also 
shattered by C. By the extremality of C, D is also strongly shattered by it, and thus there 
exists a cube B oi C with dimensions set D. ■ 

The second lemma show that for each choice of the maximal cube H, Hb C Hg. 

Lemma 7 Let s be a sample of an extremal class C, let B be any maximal cube ofC\dom{s) 
that contains s, and let D denote the dimensions set of B. Then for any cube B' of C with 
dim(i?') = D, the concept h £ B' that is consistent with s on D is also consistent with s 
on dom{s) \ D. 
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Figure 3: The one-inclusion graph of an extremal concept class C is given on the left. 

0:23:4x5 

Consider the sample s = 110. There are 4 concepts c ^ C consistent with this 
sample (the octagonal vertices), i.e. Hg = { 111100 , 110100 , 010100 , 010101 }. 
There are 2 maximal cubes of C\dom{s) (graph on right) that contain the sample 
s (in grey) with dimension sets {xsj and {x 2 ,X 4 }, respectively. Let B be the 
maximal cube with dimension set D = {x 2 , 2 : 4 }. There are 3 cubes of C (on left) 
with the same dimension set D. Each contains a concept h (shaded grey) that 

X2X4 

is consistent with the original sample on H, i.e. h\D = s\D =11 and therefore 
Hb = { 111100 , 110100 , 010100 }.l 4 ^or the correctness we need that Hb (grey 
nodes on left) is non-empty and a subset of Hg (octagon nodes on left). Note 
that in this case is a strict subset. 











Proof Since i? is a cube with dimensions set D, B\{dom{s)\D) contains the single concept 
iag{B). 

Let B' be any cube of C with dim(i?') = D, and let h be the concept in B' which is 
consistent with s on D. Now consider the cube B'\dom(s). We will show that B'\dom{s) = 
B. This will finish the proof as it shows that both h\dom{s) and s belong to B'\dom{s) = B 
which means that tag{B) = h\{dom{s) \D) = s\{dom{s) \ D). Moreover, by the definition 
of h, h\D = s\D, and therefore h\dom{s) = s as required. 

We now show that B'\dom{s) = B. Indeed, since B' is a cube of C with dimension set 
D C dom{s), the cube B'\dom{s) is a cube of C\dom{s) with the same dimension set D. 
Thus the dimension set of B'\dom{s) contains the dimension set of the maximal cube B of 
C\dom{s). Therefore, since C\dom{s) is extremal (Theorem]^ it follows by Lemmathat 
B'\dom{s) = B. ■ 


4. Unlabeled sample compression schemes and related combinatorial 
conjectnres 

The labeled compression scheme of the previous section compresses each sample of the 
concept class to a (labeled) subsample and this subsample is guaranteed to represent a 
hypothesis that is consistent with the entire original sample. Such a labeled compression 
scheme (of size equal the VC dimension d) was first found for maximum classes. In the 
previous section, we generalized this scheme to extremal classes. 

Alternate “unlabeled” compression schemes have also been found for maximum classes 
and a natural question is whether these schemes again generalize to extremal classes. As we 
shall see there is an excellent match between the combinatorics of unlabeled compression 
schemes and extremal classes. The existence of such schemes remains open at this point. 
We can however relate their existence to some natural conjectures about extremal classes. 

An unlabeled compression schemes compresses a sample s of the concept class C to 
an (unlabeled) subset of the domain of the sample s. In other words, in an unlabeled 
compression scheme the labels of the original sample are not used by the reconstruction 
map. The size of the compression scheme is now the maximum size of the subset that 
the sample is compressed to. Consider an unlabeled compression scheme for C of size 
VCdim{C). For a moment restrict your attention to samples of C over some fixed domain 
S C dom{C). Each such sample is a concept in the restriction els'. Note that two 
different concepts in C|5 must be compressed to different subsets of S, otherwise if they 
were compressed to the same subset, the reconstruction of it would not be consistent with 
one of them. For maximum classes, the number of concepts in CIS" is exactly the number of 
subsets of S of size up to the VC dimension. Intuitively, this “tightness” makes unlabeled 
compression schemes combinatorially rich and interesting. 
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Figure 4: Unlabeled compression scheme based on peeling. The vertices of the extremal 
class C from the left fignre of Fig. [^were peeled in a top down order (See table on 
left). Note that the highest vertex is always a corner of the remaining extremal 
class below. The resnlting representation sets are nnderlined in the left table. 
For example the second concept is represented by the set {xi, X 2 , X 4 } and the 
last to the empty set. 

X2X4X5 

Now consider the domain D = {x 2 ,X 4 ,X 5 } and a sample s = 1 1 0 over this 
domain. Partition C into equivalence classes snch that concepts in the same class 
are consistent on D. In the table on the right, we reordered and segmented the 
concepts of C by their equivalence classes. Each class corresponds to a member 
of ClT*. In Lemma we show that each class contains exactly one concept c snch 
that r(c) C D (marked with Each sample s of C\D is compressed to the 
unique subset of D in the equivalence class that represents this consistent concept 
(marked with In the reconstruction, each representation set is reconstructed 
to the concept it represents. In particnlar the sample s associated with the first 
class is compressed to {x 2 ,X 4 } C D which represents the consistent concept 
llllIO and “unlabeled sub sample” {x 2 ,X 4 } is reconstructed to this concept. 


Previous unlabeled compression schemes for maximum classes were based on “represen¬ 
tation maps”. For maximum classes these are one-to-one mappings between C and subsets 
of dom{C) of size at most VCdim{C). Representation maps were nsed in the following 
way: Each sample s is compressed to a subset of dom{s) which represents a consistent 
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hypothesis with s, and each subset of size at most VCdim{C) of dom{C) is reconstructed 
to the hypothesis it represents. Clearly, not every one-to-one mapping between C and 
subsets of dom{C) of size at most VCdim{C) yields an unlabeled compression scheme in 
this manner, and finding a good representation map (or proving that one exists) became 
the focus of many previous works. 

For maximum classes, representation maps r have been found that map (one-to-one) 
the concept class C to all subsets of dom{C) of size up to the VC dimension of C. The 
key combinatorial property that enabled finding representation maps for maximum classes 
was a “non clashing” condition (Kuzmin and Warmuth, 2007). This property was used to 
show that for any sample s of C there is exactly one concept c that is consistent with s and 
r(c) C dom{s). This immediately implies an unlabeled compression scheme based on non 
clashing representation maps: Compress to the unique subset of the domain of the sample 
that represents a concept consistent with the given sample. 

We will show below that representation maps naturally generalize to extremal classes: 
r must now map (one-to-one) the extremal class C to its shattered sets s{C). This is 
natural since for extremal classes \C\ = |s(C)|. We will see that again, if the non clashing 
condition holds, then for any sample s of C there is exactly one concept c that is consistent 
with s and r(c) C dom{s). 

For maximum classes, such representation maps were first shown to exist via a recursive 
construction (Kuzmin and Warmuth, 2007). Alternate representation maps were also pro¬ 
posed in (Kuzmin and Warmuth, 2007) based on a certain greedy “peeling” algorithm that 
iteratively assigns a representation to a concept and removes this concept from the class. 


The correctness of the representation maps based on peeling was finally established in (Ru¬ 


binstein and Rubinstein 2012). In this section, we show that existence of representation 


maps based on peeling hinges on certain natural and concise properties of extremal classes. 
However establishing these conjectured properties of extremal classes remains open. 


Representation maps. For any concept class C a representation map is any one-to-one 
mapping from concepts to subsets of the domain, i.e. r : C ^ V{dom{C)). We say that 
c € C is represented by the representation set r(c). Furthermore we say that two different 
concepts c, c' clash with respect to r if they are consistent with each other on the union of 
their representation sets, i.e. c| (r(c) U r(c')) = c'| (r(c) U r{c')). If no two concepts clash 
then we say that r is non clashing. 


Example 6 (Non clashing maps based on disagreements) For an arbitrary concept 
class C and ci,C 2 G C, let dis{ci,C 2 ) be the set of all dimensions on which ci,C 2 disagree, 
i.e. dis{ci,C 2 ) = {a: G dom{C) : ci{x) 7 ^ C 2 (x)}. Now let cq : dom{C) —)■ {0,1} be a fixed 
“reference” concept and define a representation map for class C as r{c) := dis{c,CQ). We 
leave it to the reader to verify that r is non clashing. 


Example 7 (A Non clashing representation map for distance preserving classes) 

Let C be a distance preserving class, that is for every u,v £ C, the distance between u, v 
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in the one-inclusion graph of C equals to their hamming distance. For every c € C, define 

degc{c) = {x G dom{C) : c is incident to an x-edge in the one-inclusion graph of C}. 

The representation map r{c) := degc{c) has the property that for every c c' G C, c and 
disagree on r(c). To see this, note that since C is isometric then any shortest path from 
c to d in C traverses exactly the dimensions on which c and d disagrees. In particular, 
the first edge leaving c in this path traverses a dimension x for which c{x) 7 ^ d{x). By the 
definition of degc{c) we have that x G degc{d) and indeed c and d disagree on degc{d). 

In fact, this gives a stronger property for distance preserving classes, which is summa¬ 
rized in the following lemma. This lemma will be useful in our analysis. 

Lemma 8 Let C be a distance preserving class and let c G C. Then degc{c) is a teaching 
set for c with respect to C. That is, for all d G C : 

d c 3x G degc{d) : c{x) d[x). 

Clearly the representation map r(c) = degc{d) is non clashing. The following lemma 
establishes that certain non clashing representation maps immediately give unlabeled com¬ 
pression schemes: 

Lemma 9 Let r he any representation map that is a bijection between an extremal class 
C and st{C). Then the following two statements are equivalent: 

1. r is non clashing. 

2. For every sample s of C, there is exactly one concept c G C that is consistent with s 
and r{c) C dom{s). 

Based on this lemma it is easy to see that a representation mapping r for an extremal 
concept class C defines a compression scheme as follows (See Algorithm and an example 
in Fig. 1^. For any sample s of C we compress s to the unique representative r(c) such 
that c is consistent with s and r(c) C dom{s). Reconstruction is even simpler, since r is 
bijective: If s is compressed to the set r{c), then we reconstruct r{c) to the concept c. 

Note that the representation set r(c) of a concept c is always an unlabeled set from 
st(C'). However, we could also compress to the labeled subsamples c|r(c). It is just that the 
labels in this type of scheme do not have any additional information and are redundant. 
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Algorithm 2 (An unlabeled compression scheme from a representation map) 

The compression map. 

Input: A sample s of C. 

1. Let c G C be the unique concept which satsifies (i) c\dom{s) = s, and (ii) r(c) C 
dom{s) 

2 . Output r(c). 

The reconstruction map. 

Input: a set S' G st{C) 

1. Since r is a bijection between C and st{C), there is a unique c such that r(c) = S'. 

2 . Output c. 


Proof [of Lemma^ 

2 1 : Proof by contrapositive. Assume -il, that is: 3c, c' ^ C, cf^ c' such that c|r(c)U 

r(c') = c'|r(c) U r(c'). Then let s = c|r(c) U r(c'). Clearly both c and d are consistent with 
s and r(c},r(c') C dom{s). This negates 2. 

I => 2 : We will show that 1 implies the following equivalent form of 2: For all 
sample domains D C dom{C) and samples s G C\D, there is exactly one concept c £ C 
that is consistent with s and r{c) C D. Recall that any domain D C dom{C) partitions 
C into equivalence classes where each class contains all concepts of C consistent with a 
sample from C\D. We need to show that each equivalence class has a unique concept in 
R := {c\ r{c) G D}. See Fig. [^/or an example. We split our goal into two parts: 

(a) C\D = R\D, i.e. for every s G C\D there is at least one c £ R such that s = c\D and 

(b) \R\D\ = |i2|, i.e. for each sample s' £ R\D there is at most c £ R such that s' = c\D. 

We first prove Part (b). Clearly \R\D\ < |i2|. Furthermore, the non-clashing condition 
(Part 1 of the lemma) implies that any distinct concepts ci,C 2 £ R disagree on r(ci) U 
^{ 02 ) F D and therefore \R\D\ = |i?|. 

Since R\D C C\D, the set equality R\D = C\D of Part (a) is implied by the fact that 
both sets have the same cardinality: 

\C\D\ = |s(C'|T))| (since C\D is extremal) 

= |s(C') n V{D)\ (holds for every concept class C and D C dom{C)) 

= |i?| (since r : C ^ s{C) is a bijection) 

= \R\D\ (by Part (b).) 
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For a more detailed proof Assume —>2, i.e. there is a sample y of C with dom{y) = Y for 
which there are either zero or (at least) two consistent concepts c for which r(c) C Y. If two 
concepts c,c'gC are consistent with y and r(c), r(c') C Y, then c|r(c)Ur(c') = c'|r(c)Ur(c') 
(which is -if). Assume now that there is no concept c consistent with some sample y of C 
for which r(c) C Y. Note that 

IClyl = |st(C'|F)| (Since C\Y is extremal.) 

= \st{C)nV{Y)\ 

= |{c : r(c) C y}| (Since r : C ^ st{C) is a bijection.) 

In other words the number of samples consistent with y equals the number of concepts with 
a representation set in Y. Partition C into equivalence classes where two concepts c, d are 
equivalent if c|y = c'|y (See Fig. for a running example). Thus, each equivalence 
class corresponds to a sample of C with domain Y. Each concept is identified by its 
representation set r(c) and the number of equivalence classes equals |{c : r(c) C y}|. By 
assumption, all concepts c in the equivalence class of sample y have r(c) ^ Y. Therefore 
by a pigeon hole argument there must be an equivalence class with two distinct concepts 
ci,C 2 G C for which r(ci),r(c 2 ) C Y. These two concept clash and again -il is implied. □ 
Once we have a valid representation mapping for some extremal concept class C, we 
can easily derive a valid mapping for any restriction of the class C\A\yY compressing every 
restricted concept. This is discussed in the following corollary. 

Corollary 10 For any extremal class C and A C dom{C), ifr is a representation mapping 
for C then a representation mapping for C\A can he constructed as follows. For any 
c G C\A, let r^(c) he the representative of the unique concept d G C, such that c'|A = c 
and r{d) C A. 

Proof The construction of the mapping for C\A essentially tells us to treat the concept c 
as a sample from C and to compress it. Thus we can apply Lemmato see that rA{d) C A 
is always uniquely dehned. Now we need to show that va satisfies the conditions of the 
Main Definition. Since the representatives r^(c) are subsets of A, the non-clashing prop¬ 
erty for the representation mapping for C\A follows from the non-clashing condition for 
r for C. The bijection property follows from a counting argument like the one used in the 
proof of Lemmaj^ since size{C\A) = size({r(c) s.t. r(c) C A}). ■ 


Corner peeling yields good representation maps. We now present a natural con¬ 
jecture concerning extremal classes and show how this conjecture can be used to construct 
non clashing representation maps. A concept c of an extremal class C is a corner of C if 
C \ {c} is extremal. By Lemma we have that for each S C dom{C) there is at most one 
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maximal cube with dimension set S and if S is the dimensions set of a non-maximal cube, 
then there are at least two cnbes with this dimension set. Therefore 


st(C' \ {c}) = st(C') \ {dim(i?) : B is maximal cube of C containing c}. 

For C \ {c} to be extremal, | st(C' \ {c})| mnst be \C\ — 1 (by Theorem]^ and therefore c 
is a corner of an extremal class C iff c lies in exactly one maximal cnbe of C. 


Conjecture 11 Every non empty extremal elass C has at least one corner. 


In (Knzmin and Warmuth, 2007) essentially the same conjecture was presented for max¬ 


imum classes. For these latter classes, the conjecture was finally proved in (Rubinstein 


and Rubinstein, 2012). This conjecture also has been proven for other special cases snch 
as extremal classes of VC dimension at most 2 (Litman and Moran, 2012; Meszaros and 


Ronyai, 2014). In fact Litman and Moran (2012) proved a stronger statement: For every 


two extremal classes Ci C C 2 such that VCdim{C 2 ) < 2 and 16*2 \ Cil > 2 , there exists an 
extremal class C such that Ci <Z C <Z C 2 (i.e. C is a strict subset of C 2 and a strict super¬ 
set of Cl). Indeed, this statement is stronger as by repeatedly picking a larger extremal 
class Cl C C 2 eventnally a c G 6*2 is obtained such that C 2 — {c} is extremal. For general 
extremal classes this stronger statement also remains open. 


Conjecture 12 For every two extremal classes Ci C C 2 with 16*2 \C'i| >2 there exists an 
extremal class C such that Ci C C C 6*2. 


Let us return to the more basic Conjecture 
representation map? Define an ordeij^on C 


11 


How does this conjecture yield a 


Cl,C2 . . .C|C| 

such that for every i, Cj is a corner of Ci = {cj : j > i}, and define a map r : C ^ st{C) 
such that r{ci) = dim(Hi) where Bi is the unique maximal cube of Ci that Ci belongs to. 
We claim that r is a representation map. Indeed, r is a one-to-one mapping from C to 
st{C) (and since C is extremal r is a bijection). To see that r is non clashing, note that 
r(cj) = dim(Rj) = deg(j. (ci). Ci is extremal and therefore distance preserving (Theorem 
Thus, Lemmaimplies that r(ci) is a teaching set of Ci with respect to Ci. This implies 
that r is indeed non clashing. 


5. Discussion 


We studied the conjecture of Floyd and Warmuth (1995) which asserts that every concept 
classes has a sample compression scheme of size linear in its VC dimension. We extended 
the family of concept classes for which the conjecture is known to hold by showing that 


5 . Such orderings are related to the recursive teaching dimension which was studied by 


Doliwa et al. 


(20101 
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every extremal class has a sample compression scheme of size equal to its VC dimension. We 
discussed the fact that extremal classes form a natural and rich generalization of maximum 
classes for which the conjecture had been proved before (Floyd and Warmuth, 1995|). 


We further related basic conjectures concerning the combinatorial structure of extremal 
classes with the existence of optimal unlabeled compression schemes. These connections 
may also be used in the future to provide a better understanding on the combinatorial 


structure of extremal classes, which is considered to be incomplete by several authors (Bol- 


lobas and Radcliffe 

1995 

Greco 

1998 

Ronyai and Meszaros 

2011 


Our compression schemes for extremal classes yield another direction of attacking the 
general conjecture of Floyd and Warmuth: it is enough to show that an arbitrary maximal 
concept class of VC dimension d can be covered by exp(d) extremal classes of VC dimension 
0{d). Note it takes additional 0{d) bits to specify which of the exp((i) extremal classes is 
used in the compression. 
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Appendix A. Proof of Claim 

To prove this claim use the following simple fact. 


Lemma 13 (Moran (2012); Anstee et al. (2002); Bollobas and Radcliffe (1995)) 


Let C,C be two complementing concept classes over domain X. Then for every Y X 
exactly one of the following holds. 


1. C strongly shatters Y. 

2. C shatters Y. 


With this lemma at hand, note that if C is extremal then for every Y C X, either C 
strongly shatters Y or C strongly shatters Y. 

Going back to our C from the construction, it is easy to verify that C (and therefore 
C) is extremal, because glueing an edge of a new dimension to a concept of an extremal 
class preserves extremality. Thus, by the above lemma VCdim{C) = d = n — 2 (because 
every subset of size 1 is strongly shattered by C but there are subsets of size 2 that are 
not strongly shattered by C). To see why C is maximal we again use the above lemma 
and observe that every concept c which is removed from C removes a set of size 1 (the set 
containing the unique dimension of the edge glued to c) from the strongly shattered sets 
of C. This means that a set of size n — 1 is added to the shattered sets of C and the VG 
dimension of C is increased from n — 2 to n — 1. 
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